HaiBiostat: Matching in R

Hai Nguyen

library(DOS) # Design of Observational Studies
library(tidyverse)
library(kableExtra)
library(optmatch)

Multivariate matching is involved in 3 steps:
(i) creating a distance matrix,
(ii) adding a propensity score caliper to the distance matrix, and
(iii) finding an optimal match.

Matching in R created largely due to the efforts of Ben Hansen (B. B. Hansen 2007), who created an R function, fullmatch, to do optimal matching from Fortan code created by Demetri Bertsekas D. P. Bertsekas (1981).

Research questions

Does financial aid increase college attendance?

Goal

Four matched samples will be constructed: The construction of one of the four matched samples is presented in step-by-step detail.

Context

During 1979-1981 Social Security Student Benefit Program provided a substantial tuition benefit for a child of a deceased father, before the cancellation in 1982.
Analyzing the impact on college attendance and completed schooling of the elimination of the Social Security Student Benefit Program in 1982
From 1965 to 1982, the Social Security Administration paid for millions of students to go to college. Under this program, the 18- to 22-year-old children of deceased, disabled, or retired Social Security beneficiaries received monthly payments while enrolled full time in college.
In 1981, Congress voted to eliminate the program. Then was cancelled in 1982
The program’s demise provides an opportunity to measure the incentive effects of financial aid. Using difference-in-differences methodology, and proxying for benefit eligibility with the death of a parent during an individual’s childhood, found that ? the elimination of the Social Security student benefit program reduced college attendance probabilities

Data

Data Xb with 2820 rows and 8 columns:
- faminc: family income in units of $10,000
- incmiss: income missing (1 if family income is missing, 0 o.w.)
- black: 1 if black, 0 o.w.
- hispanic: 1 if hispanic, 0 o.w.
- afqtpct: the Armed Forces Qualifications Test
- edm: mother’s education (1 for less than high school, 2 for high school, 3 for some college, 4 for BA degree or more)
- edmissm: mother’s education missing (1 if missing, 0 o.w.)
- female: gender (1 for female, 0 for male)

Imputation: faminc is set to 2 when incmiss = 1 indicating that faminc is missing, and edm is set to 0 when edmissm=1 indicating that edm is missing

AFQT was missing for less than 2% of subjects, and these subjects are not used in the matching. With this restriction, there are 131 high school seniors with deceased fathers and 2689 other high school seniors in the 1979-1981 cohort (131+2689 = 2820)

The treatment zb, where is 1 if the father is deceased, and 0 o.w.

data("dynarski")
dim(dynarski)

[1] 2820   10

kable(head(dynarski, n=20), caption="First 20 people in the 1979-1981 cohort, the treatment zb and the eight covariates Xb") %>%
  kable_styling(bootstrap_options = "striped", full_width = F)

Table 1: First 20 people in the 1979-1981 cohort, the treatment zb and the eight covariates Xb
id	zb	faminc	incmiss	black	afqtpct	edm	female
1	0	5.3497560	0	0	71.901660	2	0
2	0	2.4627060	0	0	95.158050	2	1
4	0	9.4876030	0	0	95.785250	2	1
5	0	0.4388592	0	0	90.617160	2	1
6	0	10.4363600	0	0	81.761160	2	0
7	0	6.2694180	0	0	97.917710	4	0
10	1	3.2204620	0	0	61.916710	2	1
11	0	9.4719470	0	0	20.446560	2	1
12	0	2.7167480	0	0	57.175110	1	0
14	0	2.0000000	1	0	8.278976	1	1
15	0	7.1157020	0	0	50.928250	2	1
16	0	7.7669970	0	0	71.149020	2	1
17	0	12.5388400	0	0	74.912190	2	0
18	0	2.0000000	1	0	98.394380	2	1
21	0	9.4041260	0	0	98.720520	4	1
23	0	2.0000000	1	0	99.724040	4	0
24	0	2.0000000	1	0	73.281490	3	1
29	0	2.0000000	1	1	1.781234	1	0
31	0	2.0000000	1	0	42.900150	2	1
34	1	2.6090910	0	0	99.623680	2	0

#write.csv(dynarski, file="dynarski.csv")

zb<-dynarski$zb
zbf<-factor(zb,levels=c(1,0),labels=c("Father deceased","Father not deceased"))
table(zbf)

zbf
    Father deceased Father not deceased 
                131                2689

Xb<-dynarski[,3:10]

$\Rightarrow$ The 131 seniors with deceased fathers will each be matched to 10 controls whose fathers were not deceased.

Covariates with Missing Values

with(dynarski, table(incmiss))

incmiss
   0    1 
2282  538

incmiss.per<-round(table(dynarski$incmiss)[2]/dim(dynarski)[1]*100,1)
paste0("Percentage of income missing: ", incmiss.per)

[1] "Percentage of income missing: 19.1"

with(dynarski, table(edmissm))

edmissm
   0    1 
2704  116

edmissm.per<-round(table(dynarski$edmissm)[2]/dim(dynarski)[1]*100,1)
paste0("Percentage of mother's education missing: ", edmissm.per)

[1] "Percentage of mother's education missing: 4.1"

Discussion of the missing covariate values will be later on another post.

Propensity Score

The propensity score is estimated using a logit model.

# Estimate the propensity score
p <- glm(zb ~ Xb$faminc + Xb$incmiss + Xb$black + Xb$hisp + Xb$afqtpct + Xb$edmissm + Xb$edm + Xb$female, family = binomial)$fitted.values

$\Rightarrow$ The vector p contains the 2820 estimated propensity scores, $\hat{e}(\mathbf{x}_l)$, $l$ = 1, 2, …, 2820.

It is reasonable to be able to improve the model: perhaps including interaction terms or transformations or polynomials or whatnot.

Estimated propensity scores for 131 seniors with deceased fathers and 2689 other seniors in the 1979-1981 cohort, before the Social Security Student Benefit Program was eliminated.

boxplot(p~zbf, ylab="Propensity score", main="1979-1981 Cohort")

dynarski$p <- p
ggplot(dynarski, aes(p, fill = zbf)) + 
  geom_histogram(alpha = 0.5, aes(y = ..density..), position = 'identity') +
  ggtitle("Figure of Overlay Histograms of PS in 2 Groups") +
  theme_bw()

$\Longrightarrow$ the median $\hat{e}(\mathbf{x})$ in the treated group (0.064) is about equal to the upper quartile among potential controls (0.063). Nonetheless, the distributions in Figure of Overlay Histogram of PS in 2 Groups overlap substantially, so matching appears to be possible.

Distance Matrix

Constructing in 2 steps:
- Step 1: computing the rankbased Mahalanobis distance $\rightarrow$ 131x2689 distance matrix

#Robust Mahalanobis distance matrix, treated x control
dmat<-smahal(zb,Xb)
dim(dmat)

[1]  131 2689

kable(round(dmat[1:5,1:5],2), caption="First five rows and columns of the 131×2689 distance matrix using the rank-based Mahalanobis distance") %>%
  kable_styling(bootstrap_options = "striped", full_width = F)

Table 2: First five rows and columns of the 131×2689 distance matrix using the rank-based Mahalanobis distance
	1	2	3	4	5
7	3.86	2.61	5.06	6.86	6.72
20	3.47	3.03	7.58	6.23	5.82
108	9.60	19.47	20.02	24.62	13.03
126	6.81	8.05	12.93	10.74	9.88
145	8.70	15.09	17.74	18.86	12.37

Step 2: adding the caliper on the propensity score. The caliper was at $0.2 \times sd(\hat{e}(\mathbf{x}))$.

#Add a caliper on the propensity score using a penalty function
dmat<-addcaliper(dmat,zb,p,caliper=.2)
dim(dmat)

[1]  131 2689

kable(round(dmat[1:5,1:5],2), caption="First five rows and columns of the 131×2689 distance matrix after adding the propensity score calipers") %>%
  kable_styling(bootstrap_options = "striped", full_width = F)

Table 3: First five rows and columns of the 131×2689 distance matrix after adding the propensity score calipers
	1	2	3	4	5
7	18.60	20.64	42.04	79.91	46.10
20	46.32	3.03	72.66	51.18	73.30
108	82.94	47.40	115.60	39.07	111.01
126	57.81	13.64	86.16	47.54	85.51
145	8.70	54.51	33.32	113.31	30.34

$\Rightarrow$ Among these 25 entries, only 2 respected the caliper and did not incur a penalty (3.03 and 8.70).

Constructing the Match (B. B. Hansen 2007)

Matching ten controls to each senior with a deceased father: The fullmatch function needs to know the distance matrix, here dmat, matching 10-to-1 means 10x131 will be used, omitting 2689−10x131 = 1379, which is the omit fraction be: 51%.

m<-fullmatch(dmat, data=dynarski, min.controls=10, max.controls=10, omit.fraction=1379/2689)

There is an entry in m for each of the 2820 seniors

length(m)

[1] 2820

The first ten entries in m are

m[1:10]

    1     2     3     4     5     6     7     8     9    10 
1.129 1.100  <NA>  1.54  <NA> 1.121 1.114  <NA> 1.111  1.87

i.e. the first senior of the 2820 seniors is in matched set #34 and the second senior is in matched set #10. The third senior was one of the 1379 unmatched controls; this is the meaning of the zero in m.01. The fourth senior is in matched set #87, the fifth is unmatched, and so on.

The function matched(·) indicates who is matched. The first ten entries of matched(m) are

matched(m)[1:10]

    1     2     3     4     5     6     7     8     9    10 
 TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE

the first two seniors were matched but the third was not, and so on.

There are 1441 matched seniors, where 1441 = 131×11, because the 131 seniors with deceased fathers were each matched to ten controls, making 131 matched sets each of size 11.

sum(matched(m))

[1] 1441

sum(matched(m))/11

[1] 131

The first 3 matched sets are in Table below. The first match consists of 11 female high school seniors, neither black nor hispanic, whose mothers had a high school education, with family incomes between $30,000 and $40,000, mostly with test scores between 59% and 77%. In the second matched set, incomes were lower but test scores were higher. And so on.

# Housekeeping
im<-as.integer(m) # matched set from 1 to 131
dynarski<-cbind(dynarski,im) 
dm<-dynarski[matched(m),] # only select matched cases, note that matched(m): T\F
dm<-dm[order(dm$im,1-dm$zb),] # sort data according to matched set then zb decreasing

# Table of matched set example
#which(dm$id==10) # [1] 188
#which(dm$id==396) # [1] 23
#which(dm$id==3051) # [1] 1068

kable(rbind(dm[188:198,-c(11,12)], dm[23:33,-c(11,12)],dm[1068:1078,-c(11,12)]), caption="The 3 of 131 matched sets, each set containing one treated subject and 10 matched controls") %>%
  kable_styling(bootstrap_options = "striped", full_width = F)

(#tab:matched set ex)The 3 of 131 matched sets, each set containing one treated subject and 10 matched controls
	id	zb	faminc	hisp	afqtpct	edm	female
7	10	1	3.220462	0	61.91671	2	1
239	350	0	3.557851	0	59.58355	2	1
252	365	0	3.599340	0	61.23934	2	1
322	465	0	3.557851	0	56.37230	2	1
361	518	0	3.788779	0	67.00953	2	1
383	550	0	3.788779	0	63.49724	2	1
928	1294	0	3.557851	0	67.31059	2	1
1481	2072	0	3.788779	0	64.97742	2	1
1489	2082	0	3.788779	0	63.62268	2	1
1560	2183	0	3.970631	0	76.51781	2	1
2810	3965	0	3.761650	0	72.57903	2	1
268	396	1	2.371901	0	88.50979	2	1
2	2	0	2.462706	0	95.15805	2	1
101	147	0	2.273267	0	77.09483	2	1
375	537	0	2.595314	0	95.96086	2	1
665	933	0	2.846281	0	96.11139	2	1
698	974	0	1.897521	0	99.59859	3	1
707	987	0	2.134711	0	81.18414	2	1
1394	1947	0	2.050298	0	91.44506	3	1
1518	2124	0	2.298786	0	72.40341	2	1
1877	2618	0	2.211560	0	68.91621	2	1
2816	3975	0	2.371901	0	90.74260	2	1
2182	3051	1	3.409901	1	62.87004	1	0
427	606	0	4.179612	1	81.73608	1	0
472	664	0	4.388592	1	91.57050	1	0
628	884	0	2.846281	1	48.77070	1	0
713	995	0	3.134709	1	55.11791	1	0
724	1008	0	3.320661	1	51.60562	1	0
1006	1399	0	3.439256	1	90.01505	2	0
2087	2908	0	3.795041	1	80.28098	1	0
2328	3262	0	3.788779	1	57.27546	1	0
2424	3400	0	2.925728	1	80.88309	2	0
2576	3624	0	3.320661	1	74.08430	1	1

Checking Covariate Balance

Note that: for covariate $k$,
- $\bar{x}_{tk}$ is the mean in the first group in the comparison,
- $\bar{x}_{ck}$ is the mean in the second group in the comparison, and
- $\bar{x}_{mck}$ is the mean in the matched comparison group formed from the second group
- $sd_{bk}$ is the standardized difference in means before matching and
- $sd_{mk}$ is the standardized difference in means after matching; they have the same denominator but different numerators

#fd <- dynarski[dynarski$zb==1,-c(1,2,12)]
#round(apply(fd, 2, FUN=mean),2)

d.tk<-dynarski %>%
  filter(zb==1) %>%
  mutate(faminc=ifelse(incmiss==1,NA, faminc), 
         edm=ifelse(edmissm==1, NA, edm)) %>%
  select(-id,-zb,-im)
x.tk<-d.tk %>% summarise_all(mean, na.rm=TRUE)
s.tk<-d.tk %>% summarise_all(sd, na.rm=TRUE)

d.ck<-dynarski %>%
  filter(zb==0) %>%
  mutate(faminc=ifelse(incmiss==1,NA, faminc), 
         edm=ifelse(edmissm==1, NA, edm)) %>%
  select(-id,-zb,-im)
x.ck<-d.ck %>% summarise_all(mean, na.rm=TRUE)
s.ck<-d.ck %>% summarise_all(sd, na.rm=TRUE)

sd.bk<-abs(x.tk-x.ck)/sqrt((s.tk^2+s.ck^2)/2)

d.cmk<-dm %>%
  filter(zb==0) %>%
  mutate(faminc=ifelse(incmiss==1,NA, faminc), 
         edm=ifelse(edmissm==1, NA, edm)) %>%
  select(-id,-zb,-im) 
x.cmk<-d.cmk %>% summarise_all(mean, na.rm=TRUE)
sd.cmk<-abs(x.tk-x.cmk)/sqrt((s.tk^2+s.ck^2)/2)

set<-round(rbind(x.tk,x.cmk,x.ck,sd.bk,sd.cmk),2)
row.names(set) <- c("x.tk","x.cmk","x.ck","sd.bk","sd.cmk")

kable(set, caption="Balance on covariates before and after matching") %>%
  kable_styling(bootstrap_options = "striped", full_width = F)

Table 4: Balance on covariates before and after matching
	faminc	incmiss	black	hisp	afqtpct	edmissm	edm	female	p
x.tk	2.78	0.15	0.35	0.15	49.58	0.08	1.62	0.49	0.07
x.cmk	2.77	0.15	0.34	0.15	49.10	0.04	1.61	0.50	0.06
x.ck	4.58	0.19	0.29	0.15	52.39	0.04	1.91	0.50	0.05
sd.bk	0.71	0.11	0.13	0.02	0.10	0.19	0.33	0.03	0.67
sd.cmk	0.00	0.00	0.02	0.03	0.02	0.19	0.02	0.02	0.09

$\Longrightarrow$ They were quite different before matching but were much closer after matching. The family income of seniors with deceased fathers was lower, they were more often black, and their mothers had less education. Between 1979-1981, AFQT test scores decline. The imbalances are much smaller after matching.

Outcomes

The outcomes were not provided in the Dynarski’s dataset. But I guessed we can report the proportions and MH-odds ratio between 2 groups after matching.

In short, during 1979-1981, when the Social Security Student Benefit Program provided tuition aid to students of a deceased Social Security beneficiary, seniors with deceased fathers were more likely to attend college and complete one year of college than were matched controls, with an odds ratio of about 1.65, but there is no sign of this in 1982-1983 after the program ended (the later can be checked because of lack the data for analysis).

Standardized differences in covariate means before and after matching in matched comparisons. The boxplot displays standardized differences in means, for the nine covariates and the propensity score.

m<-c(0,1)
d.sd<-rbind(sd.bk,sd.cmk)
d.sd<-cbind(d.sd,m)
d.sd.long<-d.sd %>%
  pivot_longer(-m,names_to="absstdzdiff", values_to="sd")
d.sd.long$m<-factor(d.sd.long$m,levels=c(0,1),labels=c("Unmatched","Matched"))
boxplot(sd~m, data=d.sd.long, xlab="", ylab="Absolute Standardized Difference", main="1979-1981 Cohort: FD vs. FND")

Matching in R

Research questions

Goal

Context

Data

Covariates with Missing Values

Propensity Score

Distance Matrix

Constructing the Match (B. B. Hansen 2007)

Checking Covariate Balance

Outcomes

Further reading

References

Corrections

Reuse

Citation